Sinceridad de Preguntas de Quora

Joaquín Cruz, Nicolás Fierro, Gabriel Norambuena, Matilde Rivas

Introducción

Quora es una plataforma en línea que permite a usuarios hacer y responder preguntas sobre una gran variedad de temas. Un problema enfrentado por Quora es detectar y eliminar preguntas consideradas "insinceras". Estas corresponden a aquellas fundadas en principios falsos, que apuntan a hacer declaraciones degradantes, o contiene frases hostiles hacia un grupo de gente.

Se pretende realizar una caracterización representativa de las preguntas y desarrollar un modelo que identifique preguntas insinceras. Con este trabajo se propone encontrar patrones en las preguntas ya etiquetadas como insinceras y resolver si se pueden establecer claras diferencias entre preguntas insinceras y sinceras.

Descripción inicial del dataset

Se cuenta con un dataset con 1.306.122 preguntas para entrenar el clasificador. Estas contienen los campos:

  • qid: El identificador de la pregunta en el sistema
  • pregunta: El texto de la pregunta
  • etiqueta: La etiqueta si es una pregunta insincera o no.

Hay que considerar que las preguntas pueden tener ruido, es decir, puede que se tengan preguntas que por definición son insinceras que no esten marcadas como tal.

Entenderemos pregunta insincera como:

  • Una pregunta que contenga un tono exagerado sobre un grupo particular de personas
  • Es una pregunta retorica y que simplemente trate de hacer una declaracion sobre un grupo de personas
  • Sugiera una idea descriminatoria sobre una clase de persona o bien busca confirmar estereotipos sobre estas.
  • Hace ataques o insulta una persona en especifico o un grupo de estas
  • Se basa en una premisa extravange sobre el grupo de personas
  • Critica en base a alguna caracteristica que no es mejorable o medible
  • Se basa en informacion falsa o bien en una premisa absurda
  • Usa contenido sexual (tales como incesto, pedofilia, entre otros) para propositos de provocar escandalo y no una respuesta genuina

Problemática Central

La problemática central de este trabajo es automatizar el proceso de etiquetado de estas preguntas. Específicamente, generar un clasificador que dada una pregunta, indique si es sincera o no.

Hito I

Metodología para abordar el problema

Para poder resolver el problema planteado, se necesita en primer lugar hacer una exploración de los datos, caracterizar cada pregunta y así clasificarlas segun su contenido. Durante la exploración de los datos se realizará lo siguiente:

  • Análisis de cantidades de datos que hay.
  • Evaluación de caracteristicas de las preguntas de cada tipo, para esto se sacarán n-gramas (de uno, dos o tres) para poder ver la estructura de los mensajes.
  • Generar wordclouds de estos para poder ver las palabras más utilizadas en cada tipo
  • Realizar palabras comunes entre ambos dataset, esto es para abordar el problema de qué magnitud tiene el ruido dentro del dataset
  • Elaborar un estudio de los metadatos de ambas categorias del dataset.

Datos

El dataset a utilizar contiene el texto de una pregunta de Quora, su etiqueta y un identificador único. Se cuenta con 1.306.122 muestras etiquetadas, de las que 1.225.312 corresponden a preguntas sinceras y 80.810 a insinceras.

In [1]:
#Descarga de datos desde kaggle
!pip install kaggle
!echo '{"username":"gnorambuena","key":"2987173ef6bc5124b61c4805ff3c958c"}' > kaggle.json
!kaggle 
!mv kaggle.json ~/.kaggle/
!kaggle competitions download -c quora-insincere-questions-classification
Requirement already satisfied: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.4)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.28.1)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle) (2019.6.16)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.12.0)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.21.0)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (3.0.2)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.5.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (2.8)
Requirement already satisfied: text-unidecode==1.2 in /usr/local/lib/python3.6/dist-packages (from python-slugify->kaggle) (1.2)
Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 6, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python2.7/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python2.7/dist-packages/kaggle/api/kaggle_api_extended.py", line 146, in authenticate
    self.config_file, self.config_dir))
IOError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /root/.kaggle/kaggle.json'
Downloading train.csv.zip to /content
 61% 33.0M/54.4M [00:00<00:00, 27.8MB/s]
100% 54.4M/54.4M [00:00<00:00, 59.0MB/s]
Downloading embeddings.zip to /content
100% 5.94G/5.96G [01:18<00:00, 70.2MB/s]
100% 5.96G/5.96G [01:18<00:00, 81.6MB/s]
Downloading sample_submission.csv.zip to /content
100% 4.08M/4.08M [00:00<00:00, 32.2MB/s]

Downloading test.csv.zip to /content
 57% 9.00M/15.7M [00:00<00:00, 13.5MB/s]
100% 15.7M/15.7M [00:00<00:00, 19.0MB/s]
In [2]:
#Descomprimimos
!unzip train.csv.zip
!unzip test.csv.zip
!unzip embeddings
!ls
Archive:  train.csv.zip
  inflating: train.csv               
Archive:  test.csv.zip
  inflating: test.csv                
Archive:  embeddings.zip
   creating: GoogleNews-vectors-negative300/
   creating: glove.840B.300d/
   creating: paragram_300_sl999/
   creating: wiki-news-300d-1M/
  inflating: glove.840B.300d/glove.840B.300d.txt  
  inflating: GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin  
  inflating: wiki-news-300d-1M/wiki-news-300d-1M.vec  
  inflating: paragram_300_sl999/README.txt  
  inflating: paragram_300_sl999/paragram_300_sl999.txt  
embeddings.zip			sample_data		   train.csv
glove.840B.300d			sample_submission.csv.zip  train.csv.zip
GoogleNews-vectors-negative300	test.csv		   wiki-news-300d-1M
paragram_300_sl999		test.csv.zip

A continuación podemos ver una muestra de lo que contiene el dataset de entrenamiento:

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
train= pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

ds_insincere = train[train.target==1]

ds_sincere = train[train.target==0]


print("Preguntas insinceras: \n {} \n {} \n {} \n {} \n {} \n".format(*list(ds_insincere[:5]["question_text"])))
print("Preguntas sinceras: \n {} \n {} \n {} \n {} \n {} \n".format(*list(ds_sincere[:5]["question_text"])))
Preguntas insinceras: 
 Has the United States become the largest dictatorship in the world? 
 Which babies are more sweeter to their parents? Dark skin babies or light skin babies? 
 If blacks support school choice and mandatory sentencing for criminals why don't they vote Republican? 
 I am gay boy and I love my cousin (boy). He is sexy, but I dont know what to do. He is hot, and I want to see his di**. What should I do? 
 Which races have the smallest penis? 

Preguntas sinceras: 
 How did Quebec nationalists see their province as a nation in the 1960s? 
 Do you have an adopted dog, how would you encourage people to adopt and not shop? 
 Why does velocity affect time? Does velocity affect space geometry? 
 How did Otto von Guericke used the Magdeburg hemispheres? 
 Can I convert montra helicon D to a mountain bike by just changing the tyres? 

En el siguiente gráfico de barras se puede apreciar que la cantidad de preguntas sinceras es considerablemente mayor a la de preguntas insinceras.

In [4]:
fig= plt.figure(figsize=(40,20))
plt.subplot2grid((2,3),(0,0))
train.target.value_counts().plot(kind="bar",alpha=0.4)
bars = ['sinceras', 'insinceras']
y_pos = [i for i, _ in enumerate(bars)]
plt.xticks(y_pos, bars)
plt.title("Tipo de preguntas en el dataset de entrenamiento")
plt.ylabel("Cantidad de preguntas")
plt.xlabel("Tipo de pregunta")

plt.show()

Wordclouds

Para hacerse una idea general del contenido de las preguntas, se generaron nubes de palabras mostrando las más frecuentes.

In [0]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image

def generate_wordcloud(data, mask_file = None, transform_mask = False):
    if mask_file is None:
      wordcloud = WordCloud(background_color="white",
                            random_state=40,
                            mode="RGB",
                            min_font_size=6,
                            max_words=150,
                            width = 300,
                            height = 300)
      wordcloud.generate(data)
      plt.figure(figsize=[10,10])
      plt.imshow(wordcloud,interpolation="bilinear")
    else:
      mask = np.array(Image.open(mask_file))
      image_colors = ImageColorGenerator(mask)
      wordcloud = WordCloud(
          background_color='white',
          max_words=150,
          max_font_size=50,
          random_state=40,
          mask = mask,
          width = 300,
          height = 300
      )
      wordcloud.generate(data)
      plt.imshow(wordcloud.recolor(3,color_func=image_colors), cmap=plt.cm.white, interpolation='bilinear')
    plt.axis("off")
    plt.show()

A continuación se presenta la nube de palabras generada con todas las preguntas:

In [6]:
all_words = " ".join(list(train["question_text"]))

generate_wordcloud(all_words)

Al generar una nube de palabras únicamente sobre las preguntas insinceras, se puede observar que las palabras mostradas cambian significativamente. Esto indica que en las preguntas insinceras existen temas recurrentes que no son generales para la totalidad de las preguntas. Se puede notar también que la mayoría de las palabras tienen que ver con política, nacionalidades e identidades de gente.

In [7]:
words = " ".join(list(ds_insincere["question_text"]))

generate_wordcloud(words)

Por el otro lado, la nube generada por las preguntas sinceras muestra casi las mismas palabras que la nube generada por la totalidad de las preguntas. Esto tiene sentido dada la proporción de pregutnas sinceras a insinceras.

In [8]:
sincere_words = " ".join(list(ds_sincere["question_text"]))

generate_wordcloud(sincere_words)

N-Gramas

Para indagar más en la diferencia de palabras usadas, se hicieron uni, bi y trigramas de las preguntas sinceras e insinceras. De esta forma se puede hacer una mejor idea de los temas tratados y las frase utilizadas recurrentemente en cada tipo de pregunta.

En los siguientes gráficos de barra se muestran los n-gramas con las 10 palabras o secuencias de palabras más utilizadas en cada tipo de pregunta.

In [9]:
import re
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')

def check_for_allowed_chars(text):
    pattern = re.compile("^([0-9A-Za-z_\-\+])*$")
    result = pattern.match(text)
    return result != None

def word_tokenizer(question,keep_all_words = False):
    tokens = nltk.word_tokenize(question)
    tokens = [word.lower() for word in tokens if word.isalpha() and not keep_all_words]
    return tokens
  
def clean_words(wordsInStr):
    cleanWords=[]
    stopwords = set(nltk.corpus.stopwords.words('english'))
    for word in wordsInStr:
        if word not in stopwords and check_for_allowed_chars(word):
            cleanWords.append(word)
    return cleanWords
  
def sort_words(wordslist):
    wordslist.sort(key=lambda x: x[1],reverse = True)
    return wordslist
  
def ngram(questions, n = 1):
    questions = " ".join(list(questions))
    words = word_tokenizer(questions)
    
    cleanedWords = clean_words(words)

        
    ngrams = [" ".join(bigram) for bigram in nltk.ngrams(cleanedWords,n)]
    fdist = nltk.FreqDist(ngrams)
    
    return fdist
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (3.2.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from nltk) (1.12.0)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

En el siguiente unigrama de preguntas sinceras, se puede ver que la palabra más utilizada es "best", "mejor" en inglés.

In [10]:
#Grafico de barra  1-grama sinceras

onegrams = ngram(ds_sincere["question_text"],1)    
ngr = sort_words(list(dict(onegrams).items()))
onegrampd = pd.DataFrame.from_records(ngr,columns=["Word","Count"])

top10=onegrampd[:10]
top10orden = top10.sort_values(by='Count', ascending=True)

ax = top10orden.plot.barh(x='Word', y='Count')

Al comparar el siguiente gráfico con el anterior se puede apreciar que todas las palabras cambian salvo "people/gente", que en este caso es la palabra más común.

In [11]:
#Grafico de barra  1-grama insinceras
onegramis = ngram(ds_insincere["question_text"],1)    
ngri = sort_words(list(dict(onegramis).items()))
onegrampdi = pd.DataFrame.from_records(ngri,columns=["Word","Count"])
top10=onegrampdi[:10]
top10orden = top10.sort_values(by='Count', ascending=True)

ax = top10orden.plot.barh(x='Word', y='Count')
In [12]:
#2-grama sinceras
twograms = ngram(ds_sincere["question_text"],2)    
ngr = sort_words(list(dict(twograms).items()))
twogrampd = pd.DataFrame.from_records(ngr,columns=["Word","Count"])
top10=twogrampd[:10]
top10orden = top10.sort_values(by='Count', ascending=True)

ax = top10orden.plot.barh(x='Word', y='Count')
In [13]:
#2-gramas  insinceras
twogramis = ngram(ds_insincere["question_text"],2)    
ngri = sort_words(list(dict(twogramis).items()))
twogrampdi = pd.DataFrame.from_records(ngri,columns=["Word","Count"])
top10=twogrampdi[:10]
top10orden = top10.sort_values(by='Count', ascending=True)

ax = top10orden.plot.barh(x='Word', y='Count')

Al analizar los trigramas de las preguntas, podemos notar una clara tendencia en cada uno. El trigrama generado por las preguntas sinceras indica que la mayoría de estas son hechas con intención de pedir consejos. Las preguntas insinceras, en cambio, aparentan girar en torno a temas políticos.

In [14]:
#3-gramas sinceras
threegrams = ngram(ds_sincere["question_text"],3)    
ngr = sort_words(list(dict(threegrams).items()))
threegrampd = pd.DataFrame.from_records(ngr,columns=["Word","Count"])
top10=threegrampd[:10]
top10orden = top10.sort_values(by='Count', ascending=True)

ax = top10orden.plot.barh(x='Word', y='Count')
In [15]:
#3-gramas  insinceras
threegramis = ngram(ds_insincere["question_text"],3)    
ngri = sort_words(list(dict(threegramis).items()))
threegrampdi = pd.DataFrame.from_records(ngri,columns=["Word","Count"])
top10=threegrampdi[:10]
top10orden = top10.sort_values(by='Count', ascending=True)

ax = top10orden.plot.barh(x='Word', y='Count')

Meta-Features

Además de los n-gramas podemos explorar algunas características extras sobre las preguntas como el largo promedio de cada una, el número de puntuaciones, la cantidad de palabras y la de stopwords, entre otras.

In [0]:
def count_number_of_words(tokenized_text):
    return len(tokenized_text)

def count_number_of_unique_words(tokenized_text):
    d = set(tokenized_text)
    return len(d)

def count_stopwords(tokenized_text):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    num_stopwords = 0
    for word in tokenized_text:
        if word in stopwords:
            num_stopwords += 1
    return num_stopwords

def count_non_stopwords(tokenized_text):
    num_words = count_number_of_words(tokenized_text)
    num_stopwords = count_stopwords(tokenized_text)
    return num_words - num_stopwords

def text_length(text):
    return len(text)

def mean_word_length(tokenized_text):
    if len(tokenized_text) == 0:
        return 0
    number_of_chars = 0
    for word in tokenized_text:
        number_of_chars += len(word)
    return number_of_chars * 1.0 / len(tokenized_text)

import string
def count_punctuation(text):
    punct = 0
    punctuation = set(string.punctuation)
    for char in text:
        if char in punctuation:
            punct += 1
    return punct

def count_titles(text):
    tokens = word_tokenizer(text,keep_all_words = True)
    count = 0
    for word in tokens:
        if word.istitle():
            count += 1
    return count
def generate_meta_features(df):
    dataframe = df["question_text"]
    tokenized_text = dataframe.apply(lambda x: word_tokenizer(x))
    print("Words Tokenized")
    num_words = tokenized_text.apply(lambda x: count_number_of_words(x)) 
    num_unique = tokenized_text.apply(lambda x: count_number_of_unique_words(x))
    num_stopwords = tokenized_text.apply(lambda x: count_stopwords(x))
    num_non_stopwords = tokenized_text.apply(lambda x: count_non_stopwords(x))
    length = dataframe.apply(lambda x: text_length(x))
    mean_length = tokenized_text.apply(lambda x: mean_word_length(x))
    punctuation = dataframe.apply(lambda x: count_punctuation(x))
    capitalized_words = dataframe.apply(lambda x: count_titles(x))
    
    dfr =  pd.concat([num_words, num_unique,num_stopwords,
                     num_non_stopwords,length,mean_length,
                      punctuation,capitalized_words,df["target"]], axis=1)
    dfr.columns = ["Number of Words","Number of Unique","Number of Stopwords",
                  "Number of non-Stopwords","Text length","Mean word length",
                   "Number of punctuations","Number of cap-words","target"]
    return dfr
In [17]:
mfi = generate_meta_features(ds_insincere)
mfs = generate_meta_features(ds_sincere)
Words Tokenized
Words Tokenized
In [0]:
def generate_histogram(mfs,mfi,attribute):
  num_words = pd.concat([mfs[attribute],mfi[attribute]],axis=1,ignore_index=True)
  num_words.columns = ["Sincere","Insincere"]
  plt.rcParams["figure.figsize"] = (20,10)
  ax = num_words.plot.hist(bins=12, alpha=0.5,density=True,title=attribute)

  ax.set_xlabel("Number")
In [19]:
generate_histogram(mfs,mfi,"Number of Words")
In [20]:
generate_histogram(mfs,mfi,"Number of Stopwords")
In [21]:
generate_histogram(mfs,mfi,"Number of Unique")
In [22]:
generate_histogram(mfs,mfi,"Text length")
In [23]:
generate_histogram(mfs,mfi,"Mean word length")

De los features anteriores se puede notar que hay diferencias entre las clases excepto en el histograma de largo promedio de las palabras.

Palabras mal escritas

Si definimos las palabras mal escritas como aquellas que aparecen menos de 10 veces en todo el dataset, podremos ver como se comportan en cada clase.

In [24]:
bads = onegrampd[onegrampd.Count < 10]["Count"].sum()
alls = onegrampd["Count"].sum()
pers = bads*100/alls
labels = "Bien escritas","Mal escritas"
sizes = [100-pers,pers]
explode = (0.2,0)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
Out[24]:
([<matplotlib.patches.Wedge at 0x7f05d9558710>,
  <matplotlib.patches.Wedge at 0x7f05d9556b00>],
 [Text(-0.151624991914192, -1.2911273608079963, 'Bien escritas'),
  Text(0.1282982235110169, 1.0924923642039415, 'Mal escritas')],
 [Text(-0.09330768733181043, -0.7945399143433822, '96.3%'),
  Text(0.0699808491878274, 0.5959049259294226, '3.7%')])
In [25]:
badi = onegrampdi[onegrampdi.Count < 10]["Count"].sum()
alli = onegrampdi["Count"].sum()
peri = badi*100/alli
labels = "Bien escritas","Mal escritas"
sizes = [100-peri,peri]
explode = (0.2,0)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
Out[25]:
([<matplotlib.patches.Wedge at 0x7f05d952f6a0>,
  <matplotlib.patches.Wedge at 0x7f05d952f940>],
 [Text(-0.3738533997109605, -1.2450837865479403, 'Bien escritas'),
  Text(0.31633756604223895, 1.0535324125580914, 'Mal escritas')],
 [Text(-0.23006363059136026, -0.7662054071064247, '90.7%'),
  Text(0.17254776329576668, 0.5746540432135043, '9.3%')])

De los gráficos de torta anteriores se puede concluir que las preguntas insinceras tienden a tener una mayor cantidad de palabras mal escritas.

Hito II

Dado a los resultados de la exploración de datos, se ha decidido seguir trabajando en torno a esta temática e hipótesis. En la exploración de datos se identificó que la mayor diferencia entre las preguntas catalogadas como sinceras y las insinceras eran los temas tratados, como se evidenció con los n-gramas. Por esto se decidió enfocar el trabajo a entrenar distintos modelos de clasificación usando Bag of Words y n-gramas. También se probó un método de aprendizaje no uspervizado, realizando topic modeling con Latent Dirichlet Allocation (LDA).

Entrenamiento de Modelos de Clasificación

Se seleccionan características del dataset para entrenar y evaluar modelos de clasificación como Bayes, Regresión Logística y Random Forest.

Bag of Words

Bag of Words es una representación de texto que describe la ocurrencia de palabras en un documento. Se crearon estas bolsas de palabras de preguntas sinceras e insinceras y se usaron para clasificar el dataset utilizando Árboles de Decisión y Regresión Logísitca.

Dado a la gran diferencia en la cantidad de muestras de cada clase, se balanceó el dataset realizando subsampling de la clase mayoritaria, las preguntas sinceras. Este se llevó a cabo eligiendo de forma aleatoria las preguntas que serían parte del sample.

In [26]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
my_train= pd.read_csv("train.csv", header=0 ,delimiter=",")
#test = pd.read_csv("test.csv.zip")
# Balanceando el dataset
my_idx = np.random.choice(my_train.loc[my_train.target == 0].index, size=1144502, replace=False)
my_train = my_train.drop(my_train.iloc[my_idx].index)
print("Data subsampled on class '0'")
print(my_train['target'].value_counts())
Data subsampled on class '0'
1    80810
0    80810
Name: target, dtype: int64

Antes de proceder se preprocesaron los datos, eliminando los caracteres sueltos y especiales, a modo de que estos no interfieran al crear las bolsas de palabras y clasificación.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import re    
from sklearn.model_selection import train_test_split
import nltk
from sklearn.model_selection import cross_validate
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score,classification_report
from sklearn.neighbors import KNeighborsClassifier # KNN
from sklearn.tree import DecisionTreeClassifier # Decision tree
from sklearn.svm import SVC  # support vector machine classifier
from sklearn.linear_model import LogisticRegression
import gc
gc.enable()
X = my_train.question_text
y = my_train.target
vectorizer = CountVectorizer(analyzer='word',ngram_range = (1,2),stop_words=stopwords.words('english'))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1, stratify=y)
In [31]:
train_x, valid_x, train_y, valid_y = train_test_split(my_train['question_text'], my_train['target'],test_size=.30, random_state=37, stratify=my_train['target'])
xtrain_1_gram_sincere = sincere_vector_1_gram.transform(train_x)
xvalid_1_gram_sincere = sincere_vector_1_gram.transform(valid_x)
X_train_sincere = sincere_vector_1_gram.transform(my_train['question_text'])
scoring = ['precision_macro', 'recall_macro', 'accuracy', 'f1_macro']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-631f5ae1bbc3> in <module>()
      1 train_x, valid_x, train_y, valid_y = train_test_split(my_train['question_text'], my_train['target'],test_size=.30, random_state=37, stratify=my_train['target'])
----> 2 xtrain_1_gram_sincere = sincere_vector_1_gram.transform(train_x)
      3 xvalid_1_gram_sincere = sincere_vector_1_gram.transform(valid_x)
      4 X_train_sincere = sincere_vector_1_gram.transform(my_train['question_text'])
      5 scoring = ['precision_macro', 'recall_macro', 'accuracy', 'f1_macro']

NameError: name 'sincere_vector_1_gram' is not defined

Se entrena y prueba el modelo de Árbol de Decisión usando la bolsa de palabras creada. Se mide el rendimiento de los modelos según su precision, recall, F1-score y accuracy.

In [33]:
#Decision tree con el bag of words
BOW_dtc = DecisionTreeClassifier()
pipe = Pipeline(steps=[('vect',vectorizer),('clf',BOW_dtc)])
pipe.fit(X_train,y_train)
predicted = pipe.predict(X_test)
print("Clasification Report:")
print(classification_report(y_test,predicted))
print("Confusion Matrix:")
print(confusion_matrix(y_test,predicted))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-33-aee03b056710> in <module>()
      1 BOW_dtc = DecisionTreeClassifier()
----> 2 pipe = Pipeline(steps=[('vect',vectorizer),('clf',BOW_dtc)])
      3 pipe.fit(X_train,y_train)
      4 predicted = pipe.predict(X_test)
      5 print("Clasification Report:")

NameError: name 'Pipeline' is not defined

Se procede a hacer lo mismo con Logistic Regression.

In [32]:
# Logistic Regression 
log_reg = LogisticRegression(solver='liblinear',multi_class='ovr')
pipe_log = Pipeline([('vect',vectorizer),('clf',log_reg)])
pipe_log.fit(X_train,y_train)
log_predicted = pipe_log.predict(X_test)
print("Clasification Report de Logistic Regression:")
print(classification_report(y_test,log_predicted))
print("Confusion Matrix:")
print(confusion_matrix(y_test,predicted ))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-32-420e55d62f2c> in <module>()
      1 log_reg = LogisticRegression(solver='liblinear',multi_class='ovr')
----> 2 pipe_log = Pipeline([('vect',vectorizer),('clf',log_reg)])
      3 pipe_log.fit(X_train,y_train)
      4 log_predicted = pipe_log.predict(X_test)
      5 print("Clasification Report de Logistic Regression:")

NameError: name 'Pipeline' is not defined
In [0]:
log_reg_cv = LogisticRegression(solver='liblinear',multi_class='ovr')
pipe_log_cv = Pipeline(steps=[('vect',vectorizer),('clf',log_reg)])
scoring = ['precision_macro', 'recall_macro', 'accuracy', 'f1_macro']
cv_results = cross_validate(pipe_log_cv,train['question_text'],train['target'],cv=7,scoring=scoring,return_train_score= True)
print('Promedio Precision:', np.mean(cv_results['test_precision_macro']))
print('Promedio Recall:', np.mean(cv_results['test_recall_macro']))
print('Promedio F1-score:', np.mean(cv_results['test_f1_macro']))
print('Promedio Accuracy:', np.mean(cv_results['test_accuracy']))

Topic Modeling - Latent Dirichlet Allocation

Estudiaremos como se comportan las preguntas del dataset utilizando LDA para encontrar tópicos comunes dentro de cada clase.

In [34]:
!pip install nltk

import pandas as pd
import matplotlib.pyplot as plt
import nltk
import numpy as np

nltk.download('punkt')
nltk.download('stopwords')

train = pd.read_csv("train.csv")
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (3.2.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from nltk) (1.12.0)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

A continuación separamos las clases de preguntas (sincera e insinceras) y subsampleamos las sinceras para obtener un mismo número de preguntas para cada clase.

Luego dividimos cada clase en un dataset de train y otro detest.

In [0]:
insincere = train[train.target==1]
sincere = train[train.target==0].sample(n=len(insincere))

from sklearn.model_selection import train_test_split

X_sincere_train, X_sincere_test, y_sincere_train, y_sincere_test = train_test_split(sincere["question_text"], sincere["target"], test_size=0.20, random_state=42)
X_insincere_train, X_insincere_test, y_insincere_train, y_insincere_test = train_test_split(insincere["question_text"], insincere["target"], test_size=0.20, random_state=42)

A continuación configuramos el vectorizador de Tf-Idf

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
tfidf_vectorizer = TfidfVectorizer(strip_accents = 'unicode',
                                stop_words = stopwords.words('english'),
                                lowercase = True,
                                ngram_range=(1,4),
                                min_df = 5,)

Creamos un Pipeline que nos ayude a entrenar el modelo LDA, para ello concatenamos el vectorizador anterior y luego éste entrega sus resultados al LDA. Luego hacemos un GridSearchCV para encontrar el mejor número de tópicos.

In [37]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# Init the Model
lda = LatentDirichletAllocation(learning_method='online')
pipe_tfidf_lda = Pipeline(steps=[('tfidf', tfidf_vectorizer), ('lda', lda)])

search_params = {'lda__n_components': [10,15,20],'lda__learning_decay': [.7, .9]}



# Init Grid Search Class
model = GridSearchCV(pipe_tfidf_lda, param_grid=search_params,verbose=5)

# Do the Grid Search
model.fit(X_sincere_train)
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
  warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] lda__learning_decay=0.7, lda__n_components=10 ...................
[CV]  lda__learning_decay=0.7, lda__n_components=10, score=-499161.823, total=  49.1s
[CV] lda__learning_decay=0.7, lda__n_components=10 ...................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   49.1s remaining:    0.0s
[CV]  lda__learning_decay=0.7, lda__n_components=10, score=-499365.586, total=  48.3s
[CV] lda__learning_decay=0.7, lda__n_components=10 ...................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.6min remaining:    0.0s
[CV]  lda__learning_decay=0.7, lda__n_components=10, score=-499583.459, total=  48.3s
[CV] lda__learning_decay=0.7, lda__n_components=15 ...................
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.4min remaining:    0.0s
[CV]  lda__learning_decay=0.7, lda__n_components=15, score=-523003.296, total=  57.8s
[CV] lda__learning_decay=0.7, lda__n_components=15 ...................
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  3.4min remaining:    0.0s
[CV]  lda__learning_decay=0.7, lda__n_components=15, score=-522875.111, total=  57.1s
[CV] lda__learning_decay=0.7, lda__n_components=15 ...................
[CV]  lda__learning_decay=0.7, lda__n_components=15, score=-523829.388, total=  57.6s
[CV] lda__learning_decay=0.7, lda__n_components=20 ...................
[CV]  lda__learning_decay=0.7, lda__n_components=20, score=-541429.959, total= 1.1min
[CV] lda__learning_decay=0.7, lda__n_components=20 ...................
[CV]  lda__learning_decay=0.7, lda__n_components=20, score=-541829.375, total= 1.1min
[CV] lda__learning_decay=0.7, lda__n_components=20 ...................
[CV]  lda__learning_decay=0.7, lda__n_components=20, score=-542959.747, total= 1.1min
[CV] lda__learning_decay=0.9, lda__n_components=10 ...................
[CV]  lda__learning_decay=0.9, lda__n_components=10, score=-497008.805, total=  52.0s
[CV] lda__learning_decay=0.9, lda__n_components=10 ...................
[CV]  lda__learning_decay=0.9, lda__n_components=10, score=-496214.210, total=  51.3s
[CV] lda__learning_decay=0.9, lda__n_components=10 ...................
[CV]  lda__learning_decay=0.9, lda__n_components=10, score=-498682.797, total=  52.0s
[CV] lda__learning_decay=0.9, lda__n_components=15 ...................
[CV]  lda__learning_decay=0.9, lda__n_components=15, score=-520304.795, total= 1.0min
[CV] lda__learning_decay=0.9, lda__n_components=15 ...................
[CV]  lda__learning_decay=0.9, lda__n_components=15, score=-519424.779, total= 1.0min
[CV] lda__learning_decay=0.9, lda__n_components=15 ...................
[CV]  lda__learning_decay=0.9, lda__n_components=15, score=-521175.512, total= 1.0min
[CV] lda__learning_decay=0.9, lda__n_components=20 ...................
[CV]  lda__learning_decay=0.9, lda__n_components=20, score=-538138.054, total= 1.2min
[CV] lda__learning_decay=0.9, lda__n_components=20 ...................
[CV]  lda__learning_decay=0.9, lda__n_components=20, score=-537261.907, total= 1.2min
[CV] lda__learning_decay=0.9, lda__n_components=20 ...................
[CV]  lda__learning_decay=0.9, lda__n_components=20, score=-538250.105, total= 1.2min
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed: 17.8min finished
Out[37]:
GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=5,
                                                        ngram_range=(1, 4),
                                                        norm='l2',
                                                        preprocessor=None,
                                                        smooth_idf=True,
                                                        stop...
                                                                  max_doc_update_iter=100,
                                                                  max_iter=10,
                                                                  mean_change_tol=0.001,
                                                                  n_components=10,
                                                                  n_jobs=None,
                                                                  perp_tol=0.1,
                                                                  random_state=None,
                                                                  topic_word_prior=None,
                                                                  total_samples=1000000.0,
                                                                  verbose=0))],
                                verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'lda__learning_decay': [0.7, 0.9],
                         'lda__n_components': [10, 15, 20]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=5)

Finalmente mostramos cuál fue el mejor modelo LDA. Se puede ver que para éste caso fue el modelo de 10 tópicos.

In [38]:
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)
Best Model's Params:  {'lda__learning_decay': 0.9, 'lda__n_components': 10}
Best Log Likelihood Score:  -497301.93286346557

Hacemos lo mismo que antes pero para las preguntas insinceras

In [0]:
tfidf_vectorizer2 = TfidfVectorizer(strip_accents = 'unicode',
                                stop_words = stopwords.words('english'),
                                lowercase = True,
                                ngram_range=(1,4),
                                min_df = 5,)
lda2 = LatentDirichletAllocation(learning_method='online')
pipe_tfidf_lda2 = Pipeline(steps=[('tfidf', tfidf_vectorizer2), ('lda', lda2)])


# Init Grid Search Class
model = GridSearchCV(pipe_tfidf_lda2, param_grid=search_params,verbose=5)

# Do the Grid Search
model.fit(X_insincere_train)
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] lda__learning_decay=0.7, lda__n_components=10 ...................
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
  warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV]  lda__learning_decay=0.7, lda__n_components=10, score=-651392.423, total= 1.1min
[CV] lda__learning_decay=0.7, lda__n_components=10 ...................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.1min remaining:    0.0s
[CV]  lda__learning_decay=0.7, lda__n_components=10, score=-649840.106, total= 1.1min
[CV] lda__learning_decay=0.7, lda__n_components=10 ...................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.2min remaining:    0.0s
[CV]  lda__learning_decay=0.7, lda__n_components=10, score=-651223.590, total= 1.1min
[CV] lda__learning_decay=0.7, lda__n_components=15 ...................
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  3.3min remaining:    0.0s

Finalmente, al igual que antes, el mejor modelo era aquel con 10 tópicos

In [0]:
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

Ahora entrenamos los modelos LDA que usaremos más adelante para visualizar y entrenar más clasificadores

In [0]:
lda_sincere = LatentDirichletAllocation(learning_method='online',n_components=10,learning_decay=0.9)
tfidf_vectorizer_sincere = TfidfVectorizer(**tfidf_vectorizer.get_params())
sincere_pipe = Pipeline([('tfidf',tfidf_vectorizer_sincere),('lda',lda_sincere)])
sincere_pipe.fit(X_sincere_train)

lda_insincere = LatentDirichletAllocation(learning_method='online',n_components=10,learning_decay=0.9)
tfidf_vectorizer_insincere = TfidfVectorizer(**tfidf_vectorizer.get_params())
insincere_pipe = Pipeline([('tfidf',tfidf_vectorizer_insincere),('lda',lda_insincere)])
insincere_pipe.fit(X_insincere_train)

Concatenamos los datasets de preguntas sinceras e insinceras

In [0]:
X_train = np.concatenate((X_sincere_train,X_insincere_train))
y_train = np.concatenate((y_sincere_train,y_insincere_train))
X_test = np.concatenate((X_sincere_test,X_insincere_test))
y_test = np.concatenate((y_sincere_test,y_insincere_test))

Luego transformamos los datasets utilizando los LDA entrenados con anterioridad. Para ello pasamos el dataset por ambos LDA y luego concatenamos el output. Esto nos debería actuar como un extractor de features y nos entregaría 10 features correspondientes a tópicos sinceros y 10 a tópicos insinceros.

In [0]:
X_train_sincere_transform = sincere_pipe.transform(X_train)
X_train_insincere_transform = insincere_pipe.transform(X_train)

X_train_concatenated = np.concatenate((X_train_sincere_transform, X_train_insincere_transform), axis=1)

X_test_sincere_transform = sincere_pipe.transform(X_test)
X_test_insincere_transform = insincere_pipe.transform(X_test)

X_test_concatenated = np.concatenate((X_test_sincere_transform, X_test_insincere_transform), axis=1)

Se procede a entrenar varios clasificadores y se evalúa su desempeño

In [0]:
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(41, weights='distance')
clf.fit(X_train_concatenated,y_train)
y_pred = clf.predict(X_test_concatenated)
print(classification_report(y_test, y_pred))
In [0]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train_concatenated,y_train)
y_pred = clf.predict(X_test_concatenated)
print(classification_report(y_test, y_pred))
In [0]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
clf = RandomForestClassifier(max_depth=5, n_estimators=10)
clf.fit(X_train_concatenated,y_train)
y_pred = clf.predict(X_test_concatenated)
print(classification_report(y_test, y_pred))
In [0]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train_concatenated,y_train)
y_pred = clf.predict(X_test_concatenated)
print(classification_report(y_test, y_pred))
In [0]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(15, 2), random_state=1)
clf.fit(X_train_concatenated,y_train)
y_pred = clf.predict(X_test_concatenated)
print(classification_report(y_test, y_pred))

De los clasificadores podemos observar que se obtiene un buen desempeño. En la mayoría se obtiene alrededor de 64% de precisión.

Se procede a visualizar cuales tópicos encontraron los modelos, para ello utilizaremos la libreria pyldavis.

In [0]:
!pip3 install pyldavis
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

#Preguntas sinceras
pyLDAvis.sklearn.prepare(lda_sincere, tfidf_vectorizer_sincere.transform(X_sincere_train), tfidf_vectorizer_sincere)

También podemos visualizar los tópicos encontrados en las preguntas insinceras

In [0]:
pyLDAvis.sklearn.prepare(lda_insincere, tfidf_vectorizer_insincere.transform(X_insincere_train), tfidf_vectorizer_insincere)

Generación de nuevos features para una posterior clasificación.

Esta sección está basada en el trabajo del profesor Felipe Bravo Marquez sobre affective tweets (https://affectivetweets.cms.waikato.ac.nz/), mediante herramientas de texto se generan tres tipos de features: n-gramas(n=1,2,3,4) , features utilizando el lexicon de Bing Liu y features generados mediante Vader. Con los features generados se realizan distintas clasificaciones asociadas a distintos algoritmos de clasificación y features utilizados, estas son detalladas a continuación:

Arbol de decisión y Logistic regression utilizando n-gramas(n=1,2,3,4)

Primero se importan las siguientes librerias:

In [0]:
import pandas as pd
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.sentiment.util import  mark_negation
from nltk.corpus import opinion_lexicon

from sklearn.feature_extraction.text import CountVectorizer  
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import confusion_matrix, cohen_kappa_score
import numpy as np
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_validate
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

Luego se carga el dataset

In [0]:
# load training and testing datasets as a pandas dataframe
data= pd.read_csv("train.csv", header=0 ,delimiter=",",usecols=(1,2), names=("V1","Class"))
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)
print(data['Class'].value_counts())

Dado que el dataset no está equilibrado se realiza subsampling sobre la clase 0.

In [0]:
np.random.seed(1)
idx = np.random.choice(data.loc[data.Class == 0].index, size=1144502, replace=False)
data = data.drop(data.iloc[idx].index)
print("Data subsampled on class '0'")
print(data['Class'].value_counts())

Se divide el dataset en entrenamiento y prueba

In [0]:
train, test = train_test_split(data, test_size=0.3)
print(train['Class'].value_counts())
print(test['Class'].value_counts())
In [0]:
X_train=train["V1"]
X_test=test["V1"]
y_train=train["Class"]
y_test=test["Class"]

#revisión del split
print("y_train:")
print(y_train.value_counts())

print("y_test:")
print(y_test.value_counts())

Los n-gramas se obtienen utilizando CountVectorizer de Scikit-learn.

In [0]:
vectorizer = CountVectorizer(tokenizer = tokenizer.tokenize, preprocessor = mark_negation, ngram_range=(1,4))  

Arbol de decisión y Logistic regression utilizando n-gramas(n=1,2,3,4) .

primero se entrena con Decision Tree obteniendo los siguientes resultados

In [0]:
from sklearn.metrics import classification_report
log_mod = LogisticRegression(solver='liblinear',multi_class='ovr')
clf = DecisionTreeClassifier()
text_clf = Pipeline([('vect', vectorizer), ('clf', clf)])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)

conf = confusion_matrix(y_test, predicted)
kappa = cohen_kappa_score(y_test, predicted) 
class_rep = classification_report(y_test, predicted)



print('Confusion Matrix for Decision Tree Classifier + ngram features:')
print(conf)
print('Classification Report')
print(class_rep)
print('kappa:'+str(kappa))

Resultados:

  • Precision:0.77
  • Recall:0.76
  • F1-score:0.76
  • kappa:0.53

luego se entrena con Logistic Regression.

In [0]:
text_clf = Pipeline([('vect', vectorizer), ('clf', log_mod)])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)

conf = confusion_matrix(y_test, predicted)
kappa = cohen_kappa_score(y_test, predicted) 
class_rep = classification_report(y_test, predicted)



print('Confusion Matrix for Logistic Regression + ngram features:')
print(conf)
print('Classification Report')
print(class_rep)
print('kappa:'+str(kappa))

Resultados:

  • Precision:0.84
  • Recall:0.84
  • F1-score:0.84
  • kappa:0.67

Arbol de decisión y Logistic regression utilizando n-gramas(n=1,2,3,4) y el lexicon de Bing Liu.

Se descarga el lexicon de opinion.

In [0]:
import nltk
nltk.download('opinion_lexicon')

Se extienden las clases de Scikit_learn BaseEstimator and TransformerMix para implementar un generador de features que usa el Lexicon de Bing Liu

In [0]:
class LiuFeatureExtractor(BaseEstimator, TransformerMixin):
    """Takes in a corpus of tweets and calculates features using Bing Liu's lexicon"""

    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.pos_set = set(opinion_lexicon.positive())
        self.neg_set = set(opinion_lexicon.negative())

    def liu_score(self,sentence):
        """Calculates the number of positive and negative words in the sentence using Bing Liu's Lexicon""" 
        tokenized_sent = self.tokenizer.tokenize(sentence)
        pos_words = 0
        neg_words = 0
        for word in tokenized_sent:
            if word in self.pos_set:
                pos_words += 1
            elif word in self.neg_set:
                neg_words += 1
        return [pos_words,neg_words]

    def transform(self, X, y=None):
        """Applies liu_score and vader_score on a data.frame containing tweets """
        values = []
        for tweet in X:
            values.append(self.liu_score(tweet))

        return(np.array(values))

    def fit(self, X, y=None):
        """This function must return `self` unless we expect the transform function to perform a 
        different action on training and testing partitions (e.g., when we calculate unigram features, 
        the dictionary is only extracted from the first batch)"""
        return self

Se utilizan los features producidos mediante el lexicon de Bing Liu y n-gramas

Se entrena con Logistic Regression:

In [0]:
liu_feat = LiuFeatureExtractor(tokenizer)
vectorizer = CountVectorizer(tokenizer = tokenizer.tokenize, preprocessor = mark_negation, ngram_range=(1,4))  
log_mod = LogisticRegression(solver='liblinear',multi_class='ovr')   
liu_ngram_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('ngram', vectorizer), ('liu',liu_feat) ])),
    ('clf', log_mod)])


liu_ngram_clf.fit(X_train, y_train)
pred_liu_ngram = liu_ngram_clf.predict(X_test)


conf_liu_ngram = confusion_matrix(y_test, pred_liu_ngram)
kappa_liu_ngram = cohen_kappa_score(y_test, pred_liu_ngram) 
class_rep_liu_ngram = classification_report(y_test, pred_liu_ngram)

print('Confusion Matrix for Logistic Regression + ngrams + features from Bing Liu\'s Lexicon')
print(conf_liu_ngram)
print('Classification Report')
print(class_rep_liu_ngram)
print('kappa:'+str(kappa_liu_ngram))

Resultados:

  • Precision:0.84
  • Recall:0.84
  • F1-score:0.84
  • kappa:0.68

Se entrena con Decision Tree:

In [0]:
liu_feat = LiuFeatureExtractor(tokenizer)
liu_ngram_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('ngram', vectorizer), ('liu',liu_feat) ])),
    ('clf', clf1)])


liu_ngram_clf.fit(X_train, y_train)
pred_liu_ngram = liu_ngram_clf.predict(X_test)


conf_liu_ngram = confusion_matrix(y_test, pred_liu_ngram)
kappa_liu_ngram = cohen_kappa_score(y_test, pred_liu_ngram) 
class_rep_liu_ngram = classification_report(y_test, pred_liu_ngram)

print('Confusion Matrix for Logistic Regression + ngrams + features from Bing Liu\'s Lexicon')
print(conf_liu_ngram)
print('Classification Report')
print(class_rep_liu_ngram)
print('kappa:'+str(kappa_liu_ngram))

Resultados:

  • Precision:0.77
  • Recall:0.77
  • F1-score:0.77
  • kappa:0.53

Arbol de decisión y Logistic regression utilizando el lexicon de Bing Liu + Vader.

Con un proceso análogo al efectuado con el Lexicon de Bing Liu se generan features con vader_lexicon y se entrena con Logisitc Regression.

In [0]:
nltk.download('vader_lexicon')
class VaderFeatureExtractor(BaseEstimator, TransformerMixin):
    """Takes in a corpus of tweets and calculates features using the Vader method"""

    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.sid = SentimentIntensityAnalyzer()


    def vader_score(self,sentence):
        """ Calculates sentiment scores for a sentence using the Vader method """
        pol_scores = self.sid.polarity_scores(sentence)
        return(list(pol_scores.values()))

    def transform(self, X, y=None):
        """Applies vader_score on a data.frame containing tweets """
        values = []
        for tweet in X:
            values.append(self.vader_score(tweet))

        return(np.array(values))

    def fit(self, X, y=None):
        """Returns `self` unless something different happens in train and test"""
        return self





vader_feat = VaderFeatureExtractor(tokenizer)
liu_feat = LiuFeatureExtractor(tokenizer)

log_mod = LogisticRegression(solver='liblinear',multi_class='ovr')   
vader_liu_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('vader', vader_feat), ('liu',liu_feat) ])),
    ('clf', log_mod)])


vader_liu_clf.fit(X_train, y_train)
pred_vader_liu = vader_liu_clf.predict(X_test)


conf_vader_liu = confusion_matrix(y_test, pred_vader_liu)
kappa_vader_liu = cohen_kappa_score(y_test, pred_vader_liu) 
class_rep_vader_liu = classification_report(y_test, pred_vader_liu)

print('Confusion Matrix for Logistic Regression + Vader + features from Bing Liu\'s Lexicon')
print(conf_vader_liu)
print('Classification Report')
print(class_rep_vader_liu)
print('kappa:'+str(kappa_vader_liu))

Resultados:

  • Precision:0.68
  • Recall:0.67
  • F1-score:0.67
  • kappa:0.34

se entrena con Decision Tree.

In [0]:
vader_liu_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('vader', vader_feat), ('liu',liu_feat) ])),
    ('clf', clf)])


vader_liu_clf.fit(X_train, y_train)
pred_vader_liu = vader_liu_clf.predict(X_test)


conf_vader_liu = confusion_matrix(y_test, pred_vader_liu)
kappa_vader_liu = cohen_kappa_score(y_test, pred_vader_liu) 
class_rep_vader_liu = classification_report(y_test, pred_vader_liu)

print('Confusion Matrix for DecisionTree + Vader + features from Bing Liu\'s Lexicon')
print(conf_vader_liu)
print('Classification Report')
print(class_rep_vader_liu)
print('kappa:'+str(kappa_vader_liu))

Resultados:

  • Precision:0.64
  • Recall:0.64
  • F1-score:0.63
  • kappa:0.27

Arbol de decisión y Logistic regression utilizando n-gramas y el lexicon de Bing Liu + Vader.

Se utilizan los tres feautures anteriormente exhibidos y se entrena con Logistic Regression:

In [0]:
ngram_lex_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('ngram', vectorizer), ('vader',vader_feat),('liu',liu_feat)  ])),
    ('clf', log_mod)])


ngram_lex_clf.fit(X_train, y_train)
pred_ngram_lex = ngram_lex_clf.predict(X_test)


conf_ngram_lex = confusion_matrix(y_test, pred_ngram_lex)
kappa_ngram_lex = cohen_kappa_score(y_test, pred_ngram_lex) 
class_rep = classification_report(y_test, pred_ngram_lex)


print('Confusion Matrix for Logistic Regression + ngrams + features from Bing Liu\'s Lexicon and the Vader method')
print(conf_ngram_lex)
print('Classification Report')
print(class_rep)
print('kappa:'+str(kappa_ngram_lex))

Resultados:

  • Precision:0.84
  • Recall:0.84
  • F1-score:0.84
  • kappa:0.68

Ahora se realiza con DecisionTree

In [0]:
ngram_lex_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('ngram', vectorizer), ('vader',vader_feat),('liu',liu_feat)  ])),
    ('clf', clf)])


ngram_lex_clf.fit(X_train, y_train)
pred_ngram_lex = ngram_lex_clf.predict(X_test)


conf_ngram_lex = confusion_matrix(y_test, pred_ngram_lex)
kappa_ngram_lex = cohen_kappa_score(y_test, pred_ngram_lex) 
class_rep = classification_report(y_test, pred_ngram_lex)


print('Confusion Matrix for DecisionTree + ngrams + features from Bing Liu\'s Lexicon and the Vader method')
print(conf_ngram_lex)
print('Classification Report')
print(class_rep)
print('kappa:'+str(kappa_ngram_lex))

Resultados:

  • Precision:0.76
  • Recall:0.76
  • F1-score:0.76
  • kappa:0.53

k-fold cross validation (k= 6) con n-gramas(n=1,2,3,4)

In [0]:
import pandas as pd       
from nltk.tokenize import TweetTokenizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.sentiment.util import  mark_negation
from nltk.corpus import opinion_lexicon

from sklearn.feature_extraction.text import CountVectorizer  
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import confusion_matrix, cohen_kappa_score
import numpy as np
from time import time
In [0]:
tiempo_inicial = time() 
In [0]:
# load training and testing datasets as a pandas dataframe
data= pd.read_csv("train.csv", header=0 ,delimiter=",",usecols=(1,2), names=("V1","Class"))
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)
In [0]:
print("Distribucion de clases original")
data['Class'].value_counts()
In [0]:
# subsampling sobre la clase 0 para una muestra de 41620 datos balanceados
idx = np.random.choice(data.loc[data.Class == 0].index, size=1204502, replace=False)
data1 = data.drop(data.iloc[idx].index)
idy = np.random.choice(data1.loc[data.Class == 1].index, size=60000, replace=False)
data=data1.drop(data.iloc[idy].index)
print("Data subsampled on class '0'")
print(data['Class'].value_counts())
In [0]:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_validate
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA

vectorizer = CountVectorizer(tokenizer = tokenizer.tokenize, preprocessor = mark_negation, ngram_range=(1,4))  
print (vectorizer)
print(data[:5])

X = vectorizer.fit_transform(data['V1'])
Y=data.Class
In [0]:
scoring = ['precision_macro', 'recall_macro', 'accuracy', 'f1_macro']
clf= DecisionTreeClassifier()
cv_results = cross_validate(clf, X, Y, cv=6,  scoring = scoring, return_train_score= True)

print('Promedio Precision:', np.mean(cv_results['test_precision_macro']))
print('Promedio Recall:', np.mean(cv_results['test_recall_macro']))
print('Promedio F1-score:', np.mean(cv_results['test_f1_macro']))
print('Promedio Accucary:', np.mean(cv_results['test_accuracy']))

Resultados:

  • Promedio Precision: 0.7862912240137385
  • Promedio Recall: 0.7860405598263701
  • Promedio F1-score: 0.7859922603983271
  • Promedio Accucary: 0.7860405598263701

Análisis de Resultados

A continuación se presenta una tabla de resumen de los resultados obtenidos por los clasificadores en cada caso.

Clasificador Features Kappa Score F1 Score
Decision Tree n-gramas (n=1,2,3,4) 0.53 0.76
Logistic Regression n-gramas (n=1,2,3,4) 0.67 0.84
Decision Tree n-gramas + Lexicon de Bing Liu 0.54 0.77
Logistic Regression n-gramas + Lexicon de Bing Liu 0.68 0.84
Decision Tree Lexicon de Bing Liu + Vader 0.27 0.63
Logistic Regression Lexicon de Bing Liu + Vader 0.34 0.67
Decision Tree n-gramas + Lexicon de Bing Liu + Vader 0.52 0.76
Logistic Regression n-gramas +Lexicon de Bing Liu + Vader 0.68 0.84

Todos los modelos presentan resultados mejores que el baseline, que es clasificar los datos de forma aleatoria. La regresión logística obtiene los F1-Scores más altos para cada set de features, alcanzando 0.84 al clasificar utilizando ngramas. Al clasificar en base al conjunto de lexicons los resultados de ambos clasificadores empeoran tanto en Kappa Score como F1 Score.

Se estudiaron los componentes de la regresón logística para ver cómo se le asigna el peso a las estructuras.

Peso Feature
+3.534 incest
+3.317 muslims
+3.260 castrated
+3.254 rape
+3.123 jews
+2.921 fuck
+2.885 atheists
+2.866 indians
+2.860 democrats
+2.848 castrate
+2.819 dick
+2.782 gay
+2.715 liberals
+2.670 modi
+2.576 homosexual
+2.555 muslim
+2.555 penis
+2.544 trump
+2.534 raped
+2.456 jew
...980971 more

Podemos ver que hay palabras a las que se les asigna bastante peso, a pesar de que su uso en una pregunta no necesariamente implique que esta sea insincera. Esto se debe a que estas palabras se encuentran con frecuencia en las preguntas insinceras del conjunto de entrenamiento.

Caso Trump

Decidimos estudiar la palabra "Trump" pues es recurrente en el dataset, principalmente en preguntas insinceras, y se le asigna harto peso al clasificar. Quisimos analizar cómo se comportan los clasificadores anteriores cuando una pregunta contiene la palabra "Trump".

Elegimos 5 preguntas a clasificar :

  • "Why is Donald Trump such an idiot?" (Insincera)
  • "Who is Trump?" (Sincera)
  • "Is Trump a good president?" (Sincera)
  • "Why Trump took harsh actions against Huawei" (Sincera)
  • "What are 10 worst economic indicators of Trump government" (Sincera)

De estas preguntas solo la primera cumple los criterios pra ser considerada como insincera.

Se compararon los resultados al clasificar con el modelo de regresión logística entrenado con lexicons y n-grams versus el entrenado solamente con lexicons.

n-grams y lexicons

El modelo entrenado con n-grams y lexicons clasificó las 5 preguntas como insinceras, generando 4 falsos positivos y un verdadero positivo.

Predichas Insinceras:

  • "Why is Donald Trump such an idiot?" (Insincera)
  • "Who is Trump?" (Sincera)
  • "Is Trump a good president?" (Sincera)
  • "Why Trump took harsh actions against Huawei" (Sincera)
  • "What are 10 worst economic indicators of Trump government" (Sincera)

lexicons

Al utilizar el modelo entrenado sólo con lexicons, se lograron clasificar correctamente 1 pregunta insincera y 2 sinceras, produciendo 2 falsos positivos.

Predichas como Insinceras:

  • "Why is Donald Trump such an idiot?" (Insincera)
  • "Why Trump took harsh actions against Huawei" (Sincera)
  • "What are 10 worst economic indicators of Trump government" (Sincera)

Predichas como Sinceras:

  • "Who is Trump?" (Sincera)
  • "Is Trump a good president?" (Sincera)

Podemos concluir que al quitarle el peso asignado a la palabra se producen mejores resultados en cuanto a falsos positivos.

Código del análisis de resultados

Se importan las librerias necesarias para esta sección

In [0]:
import pandas as pd
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.sentiment.util import  mark_negation
from nltk.corpus import opinion_lexicon

from sklearn.feature_extraction.text import CountVectorizer  
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import confusion_matrix, cohen_kappa_score
import numpy as np
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_validate
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
In [0]:
# load training and testing datasets as a pandas dataframe
data= pd.read_csv("train.csv", header=0 ,delimiter=",",usecols=(1,2), names=("V1","Class"))
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)
print(data['Class'].value_counts())
In [0]:
#np.random.seed(1)

#subsampling
idx = np.random.choice(data.loc[data.Class == 0].index, size=1144502, replace=False)
data = data.drop(data.iloc[idx].index)
print("Data subsampled on class '0'")
print(data['Class'].value_counts())
In [0]:
train, test = train_test_split(data, test_size=0.3)
print(train['Class'].value_counts())
print(test['Class'].value_counts())
In [0]:
X_train=train["V1"]
X_test=test["V1"]
y_train=train["Class"]
y_test=test["Class"]

#revisión del split
print("y_train:")
print(y_train.value_counts())

print("y_test:")
print(y_test.value_counts())
In [0]:
#dataset simple para probar supuestos
x_simple=["Who is Trump?","Why trump took harsh sanctions against huawei?","What are the top 10 worst economic indicators of the trump government?","why donald trump is such an idiot?","Is trump a good president?","What are the main beliefs of the Jews?","Which is the country with the largest population of atheists?"]
y_simple=[0,0,0,1,0,0,0]
In [0]:
vectorizer = CountVectorizer(tokenizer = tokenizer.tokenize, preprocessor = mark_negation, ngram_range=(1,3))  
In [0]:
def compare(X,y,pred):
    T=['' for x in range(pred.size)];
    for i in range(pred.size):
        if (y.values[i]==pred[i] and pred[i]==1):
            T[i]='TP'
        elif (y.values[i]==pred[i] and pred[i]==0):
            T[i]='TF'
        elif (y.values[i]!=pred[i] and pred[i]==0):
            T[i]='FN'
        elif (y.values[i]!=pred[i] and pred[i]==1):
            T[i]='FP'
    dataset = pd.DataFrame({'V1':X.values,'Class':y.values,'Pred':pred,'Type':T});
    return dataset
In [0]:
def compare2(x,y,pred):
    T=['' for x in range(pred.size)];
    for i in range(pred.size):
        if (y[i]==pred[i] and pred[i]==1):
            T[i]='TP'
        elif (y[i]==pred[i] and pred[i]==0):
            T[i]='TF'
        elif (y[i]!=pred[i] and pred[i]==0):
            T[i]='FN'
        elif (y[i]!=pred[i] and pred[i]==1):
            T[i]='FP'
    dataset = pd.DataFrame({'V1':x,'Class':y,'Pred':pred,'Type':T});
    return dataset
In [0]:
#regresión logística de n-gramas
log_mod = LogisticRegression(solver='liblinear',multi_class='ovr')
text_clf = Pipeline([('vect', vectorizer), ('clf', log_mod)])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)
conf = confusion_matrix(y_test, predicted)
kappa = cohen_kappa_score(y_test, predicted) 
class_rep = classification_report(y_test, predicted)

#predecimos la prueba para supuestos
predicted_simple_1= text_clf.predict(x_simple)



print('Confusion Matrix for Logistic Regression + ngram features:')
print(conf)
print('Classification Report')
print(class_rep)
print('kappa:'+str(kappa))
In [0]:
!pip install eli5
In [0]:
#pesos de la regresión
import eli5
eli5.show_weights(log_mod, vec=vectorizer, top=50 )
In [0]:
result1_simple=compare2(x_simple,y_simple,predicted_simple_1)
print(result1_simple)
In [0]:
nltk.download('opinion_lexicon')
In [0]:
class LiuFeatureExtractor(BaseEstimator, TransformerMixin):
    """Takes in a corpus of tweets and calculates features using Bing Liu's lexicon"""

    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.pos_set = set(opinion_lexicon.positive())
        self.neg_set = set(opinion_lexicon.negative())

    def liu_score(self,sentence):
        """Calculates the number of positive and negative words in the sentence using Bing Liu's Lexicon""" 
        tokenized_sent = self.tokenizer.tokenize(sentence)
        pos_words = 0
        neg_words = 0
        for word in tokenized_sent:
            if word in self.pos_set:
                pos_words += 1
            elif word in self.neg_set:
                neg_words += 1
        return [pos_words,neg_words]

    def transform(self, X, y=None):
        """Applies liu_score and vader_score on a data.frame containing tweets """
        values = []
        for tweet in X:
            values.append(self.liu_score(tweet))

        return(np.array(values))

    def fit(self, X, y=None):
        """This function must return `self` unless we expect the transform function to perform a 
        different action on training and testing partitions (e.g., when we calculate unigram features, 
        the dictionary is only extracted from the first batch)"""
        return self
In [0]:
liu_feat = LiuFeatureExtractor(tokenizer)
vectorizer = CountVectorizer(tokenizer = tokenizer.tokenize, preprocessor = mark_negation, ngram_range=(1,4))  
log_mod = LogisticRegression(solver='liblinear',multi_class='ovr')   
liu_ngram_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('ngram', vectorizer), ('liu',liu_feat) ])),
    ('clf', log_mod)])


liu_ngram_clf.fit(X_train, y_train)
pred_liu_ngram = liu_ngram_clf.predict(X_test)


conf_liu_ngram = confusion_matrix(y_test, pred_liu_ngram)
kappa_liu_ngram = cohen_kappa_score(y_test, pred_liu_ngram) 
class_rep_liu_ngram = classification_report(y_test, pred_liu_ngram)

#predecimos la prueba para supuestos
predicted_simple_2= liu_ngram_clf.predict(x_simple)


predicted_simple_2= text_clf.predict(x_simple)
print('Confusion Matrix for Logistic Regression + ngrams + features from Bing Liu\'s Lexicon')
print(conf_liu_ngram)
print('Classification Report')
print(class_rep_liu_ngram)
print('kappa:'+str(kappa_liu_ngram))
In [0]:
result2_simple=compare2(x_simple,y_simple,predicted_simple_2)
print("pruebas para Logistic Regression + ngrams + features from Bing Liu's Lexicon")
print(result2_simple)
In [0]:
nltk.download('vader_lexicon')
class VaderFeatureExtractor(BaseEstimator, TransformerMixin):
    """Takes in a corpus of tweets and calculates features using the Vader method"""

    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.sid = SentimentIntensityAnalyzer()


    def vader_score(self,sentence):
        """ Calculates sentiment scores for a sentence using the Vader method """
        pol_scores = self.sid.polarity_scores(sentence)
        return(list(pol_scores.values()))

    def transform(self, X, y=None):
        """Applies vader_score on a data.frame containing tweets """
        values = []
        for tweet in X:
            values.append(self.vader_score(tweet))

        return(np.array(values))

    def fit(self, X, y=None):
        """Returns `self` unless something different happens in train and test"""
        return self
In [0]:
vader_feat = VaderFeatureExtractor(tokenizer)
liu_feat = LiuFeatureExtractor(tokenizer)

log_mod = LogisticRegression(solver='liblinear',multi_class='ovr')   
vader_liu_clf = Pipeline([ ('feats', 
                            FeatureUnion([ ('vader', vader_feat), ('liu',liu_feat) ])),
    ('clf', log_mod)])


vader_liu_clf.fit(X_train, y_train)
pred_vader_liu = vader_liu_clf.predict(X_test)


conf_vader_liu = confusion_matrix(y_test, pred_vader_liu)
kappa_vader_liu = cohen_kappa_score(y_test, pred_vader_liu) 
class_rep_vader_liu = classification_report(y_test, pred_vader_liu)

#predecimos la prueba para supuestos
predicted_simple_3= vader_liu_clf.predict(x_simple)

print('Confusion Matrix for Logistic Regression + Vader + features from Bing Liu\'s Lexicon')
print(conf_vader_liu)
print('Classification Report')
print(class_rep_vader_liu)
print('kappa:'+str(kappa_vader_liu))
In [0]:
result3_simple=compare2(x_simple,y_simple,predicted_simple_3)
print(result3_simple)
In [0]:
from sklearn.feature_extraction import text 
delete={"trump","white","muslims","atheists","jew"}
stop_words = text.ENGLISH_STOP_WORDS.union(delete)
print("trump"in stop_words and "white"in stop_words and "muslims"in stop_words and  "atheists"in stop_words and  "jew"in stop_words)
In [0]:
#regresión logística de n-gramas con algunas palabras añadidas a stopwords
log_mod = LogisticRegression(solver='liblinear',multi_class='ovr')
text_clf = Pipeline([('vect', vectorizer), ('clf', log_mod)])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)
conf = confusion_matrix(y_test, predicted)
kappa = cohen_kappa_score(y_test, predicted) 
class_rep = classification_report(y_test, predicted)

#predecimos la prueba para supuestos
predicted_simple_4= text_clf.predict(x_simple)



print('Confusion Matrix for Logistic Regression + ngram features:')
print(conf)
print('Classification Report')
print(class_rep)
print('kappa:'+str(kappa))
In [0]:
#resultados test simple con palabras añadidas a stopwords
result4_simple=compare2(x_simple,y_simple,predicted_simple_4)
print(result3_simple)

Modelo RNN(LSTM)

Entrenaremos un modelo de deep learning, para ello utilizaremos pytorch y su capa LSTM.

Primero escribiremos las funciones para cargar y preprocesar las preguntas

In [0]:
import numpy as np
from tqdm import tqdm
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence
import pandas as pd
import re

pattern = r"\w+|[^\w\s]"

EMBEDDING_FILE = './glove.840B.300d/glove.840B.300d.txt'
DATA_FILE = 'train.csv'
def questions_to_indices(questions,embeddings_index):
  indices = []
  for question in questions:
    tokens = re.findall(pattern,question)
    ind = torch.LongTensor([embeddings_index[tok.lower()] if tok.lower() in embeddings_index else 0 for tok in tokens]).view(-1,1)
    indices.append(ind)
  return pad_sequence(indices,batch_first=True)[:,:30]

def create_embeddings_index(embeddings_dict):
  embeddings_index = {key:i+1 for i,key in tqdm(enumerate(embeddings_dict))}
  embeddings_list = ['' for i in range(len(embeddings_index)+1)]

  embeddings_list[0] = torch.zeros(1,300)

  for key,value in tqdm(embeddings_index.items()):
    embeddings_list[value] = embeddings_dict[key].view(1,-1)
  embeddings_list = torch.stack(embeddings_list).view(-1,300)
  return embeddings_index, nn.Embedding.from_pretrained(embeddings_list,padding_idx=0)

def create_embedding(EMBEDDING_FILE,DATA_FILE):
  train = pd.read_csv(DATA_FILE)

  insincere = train[train.target==1]
  insincere.loc[:,"target"] = 0.8 * insincere["target"]
  #insincere = pd.concat([insincere]*2, ignore_index=True)
  sincere = train[train.target==0].sample(n=len(insincere))
  sincere.loc[:,"target"] = 0.2 + sincere["target"]
  data = pd.concat([insincere,sincere])
  
  def get_coefs(word,*arr): 
    return word, torch.from_numpy(np.asarray(arr, dtype='float32'))
  print("\nReading embedding file")
  embeddings_dict = dict(get_coefs(*o.split(" ")) for o in tqdm(open(EMBEDDING_FILE)))
  print("\n Done")
  print("Generating embedding")
  embeddings_index, embedding = create_embeddings_index(embeddings_dict)
  print("\n Done")
  print("Transforming Data")
  questions = questions_to_indices(data["question_text"],embeddings_index)
  print("\n Done")
  return embedding, embeddings_index, (questions.numpy(),np.asarray(data["target"]))
In [0]:
embedding, embedding_index, dataset = create_embedding(EMBEDDING_FILE,DATA_FILE)

A continuacion generamos los dataset de train,valid,test. Con 80%,10%,10% de lo datos, respectivamente.

In [0]:
from sklearn.model_selection import train_test_split
x,y = dataset
print(x.shape)
print(y.shape)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2019)
x_test, x_valid, y_test, y_valid = train_test_split(x_test, y_test, test_size=0.5, random_state=2019)

Luego generamos DataLoaders para cada dataset, esto nos permite ordenarlos aleatoriamente y que se pueda iterar en mini-batches de tamaño 32.

In [0]:
import torch
from torch.utils.data import DataLoader, TensorDataset

train_data = TensorDataset(torch.from_numpy(x_train).squeeze(), torch.from_numpy(y_train))
valid_data = TensorDataset(torch.from_numpy(x_valid).squeeze(), torch.from_numpy(y_valid))
test_data = TensorDataset(torch.from_numpy(x_test).squeeze(), torch.from_numpy(y_test))

batch_size = 32

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, drop_last=True)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size, drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size, drop_last=True)

Veamos una muestra de un mini-batch

In [0]:
# Obtenemos una muestra del loader
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()
print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Implementacion del modelo

A continuación implementamos el modelo. En este caso pasaremos las palabras primero por el embedding, luego por la capa LSTM, una capa fully connected de 10 neuronas y finalmente 1 neurona que nos diga si es sincera o insincera la pregunta.

In [0]:
class QuoraLSTM(nn.Module):

    def __init__(self, embedding, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0):
       
      super().__init__()

      self.output_size = output_size
      self.n_layers = n_layers
      self.hidden_dim = hidden_dim
      
      self.embedding = embedding
      self.batchNorm = nn.BatchNorm1d(embedding_dim)
      self.dropout = nn.Dropout(drop_prob)
      self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                          dropout=drop_prob,bidirectional=True)

      self.batchNorm2 = nn.BatchNorm1d(hidden_dim)

      self.fc = nn.Linear(hidden_dim, 10)
      self.relu = nn.ReLU()
      self.dropout2 = nn.Dropout(drop_prob)
      self.fc2 = nn.Linear(10,output_size)
      self.sig = nn.Sigmoid()
     
      
    def forward(self, x):
        
        batch_size = x.size(0)
        
        embeds = self.embedding(x)
        embeds = self.batchNorm(embeds.transpose(1,2)).transpose(1,2)
        embeds = self.dropout(embeds).transpose(0,1)
        lstm_out, hidden = self.lstm(embeds)

        lstm_out = lstm_out.transpose(0,1).contiguous().view(-1, self.hidden_dim)

        out = self.batchNorm2(lstm_out)
        out = self.fc(out)
        out = self.relu(out)
        out = self.dropout2(out)
        out = self.fc2(out)
        sig_out = self.sig(out)
        
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] 
        return sig_out

Podemos ver el tamaño del embedding que generamos inicialmente, podemos notar que hay 2196017 embeddings distintos en nuestro dataset (GLoVe).

In [0]:
print(embedding)

Instanciamos nuestro modelo, para ello le pasamos la dimension del embedding, que en este caso es 300, el tamaño del vocabulario que encontramos anteriormente y el número de capas de la red.

In [0]:
vocab_size = 2196017
output_size = 1
embedding_dim = 300
hidden_dim = 64
n_layers = 3
net = QuoraLSTM(embedding, output_size, embedding_dim, hidden_dim, n_layers,drop_prob=0.5)

print(net)

Si lo deseamos podemos usar torchviz para visualizar el modelo que creamos.

In [0]:
!pip install torchviz
from torchviz import make_dot

make_dot(net.cuda()(sample_x.cuda()), params=dict(list(net.named_parameters()) + [('x', sample_x.cuda())]))

Entrenamiento

Finalmente creamos el loop que entrenará la red y la evaluará.

In [0]:
#Learning rate
lr=0.001

#Loss function
criterion = nn.BCELoss()
#Optimizer that adjust the weights of the net in every iteration
optimizer = torch.optim.Adam(net.parameters(), lr=lr,amsgrad=True)

#setup training on gpu (this is a lot faster!)
train_on_gpu = True

#number of epochs, meaning the number of times the net is going to see the training data
epochs = 4

counter = 0
print_every = 200
clip=5 # gradient clipping

#save train and val losses
train_loss = []
valid_loss = []
# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()

#training loop
for e in range(epochs):
    train_losses = []
    for inputs, labels in train_loader:
        counter += 1
       
        inputs = inputs.type(torch.LongTensor)
        
        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

       
        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model

        output = net(inputs)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        train_losses.append(loss.item())
        
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            correct = 0
            total = 0
            # Get validation loss

            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                inputs = inputs.type(torch.LongTensor)

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output = net(inputs)
                val_loss = criterion(output.squeeze(), labels.float())
                val_losses.append(val_loss.item())
                # convert output probabilities to predicted class (0 or 1)
                pred = torch.round(output.squeeze()) # rounds to the nearest integer
                labels = torch.round(labels)
                if train_on_gpu:
                  pred = pred.cpu()
                  labels = labels.cpu()
                
                correct += (pred.long() == labels.long()).sum().item()
                total += labels.size(0)
            
            accuracy = 100 * correct / total
            net.train()
            train_loss.append(np.mean(train_losses))
            valid_loss.append(np.mean(val_losses))
            train_losses = []
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Accuracy: {:.6f}...".format(accuracy),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Luego de haber entrenado nuestra red, la evaluamos utilizando el dataset de test.

In [0]:
test_losses = [] # track loss
predicted = []
ground_truth = []

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    inputs = inputs.type(torch.LongTensor)

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs

    output = net(inputs)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    labels = torch.round(labels)
    if train_on_gpu:
      pred = pred.cpu()
      labels = labels.cpu()
    
    predicted.append(pred.detach().numpy())
    ground_truth.append(labels.detach().numpy())
    
predicted = np.concatenate(predicted)
ground_truth = np.concatenate(ground_truth)

Graficamos el loss durante el entrenamiento.

In [0]:
import matplotlib.pyplot as plt
timesteps = [print_every*i for i in range(1,len(train_loss)+1)]
plt.figure(figsize=(20,10))
plt.plot(timesteps,train_loss,label='train loss')
plt.plot(timesteps,valid_loss,label='valid loss')
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Loss of LSTM during training')
plt.legend()
#plt.legend(['train loss','valid loss'])
plt.show()

También podemos generar las métricas de clasificación con el siguiente código.

In [0]:
from sklearn.metrics import classification_report
target_names = ['sincere', 'insincere']
print(classification_report(ground_truth, predicted, target_names=target_names))
In [0]:
import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix

#this function is from scikit website.
def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

Graficamos la matriz de confusión

In [0]:
plot_confusion_matrix(ground_truth, predicted, classes=["Sincere","Insincere"], normalize=True,
                      title='Matriz de Confusión RNN')
plt.figure(figsize=(20,20))

plt.show()

Podemos crear una función auxiliar que clasifique nuevas preguntas utilizando el modelo ya entrenado.

In [0]:
def classify_questions(questions):
  indices = questions_to_indices(questions,embedding_index)
  if train_on_gpu:
    indices = indices.cuda()
 
 
  output = net(indices.view(len(questions),-1)).detach()
  if train_on_gpu:
    output = output.cpu()

  return torch.round(output).long().numpy()

def label_to_classname(labels):
  return ["Insincere" if label else "Sincere" for label in labels]

A continuación se clasifica una muestra de preguntas utilizando la red entrenada.

In [0]:
q = classify_questions(["If an attorney knows for certain that their client is guilty and proceeds to prove their innocence, does this make the attorney complicit in the crime?",
                        "How did Ecuador become so trendy?",
                        "What does the abbreviation 'AKA' mean in movie titles?",
                        "Who is Donald Trump?",
                        "Is the sun yellow or white?"
                       ])
print(label_to_classname(q))

Algo interesante de observar es la precisión de las preguntas que contengan a Trump, es por eso que con la siguiente función obtenemos aquellas preguntas que lo contengan.

In [0]:
def get_trump_dataset(DATA_FILE):
  dataframe = pd.read_csv(DATA_FILE)
  dataframe['question_text']= dataframe['question_text'].map(lambda s:s.lower() if type(s) == str else s)
  trump_dataset = dataframe[dataframe['question_text'].str.contains('trump')]
  return trump_dataset["question_text"],np.around(np.asarray(trump_dataset["target"]))

trump_questions,trump_labels = get_trump_dataset(DATA_FILE)

Luego de esto procedemos a clasificar las preguntas y mostrar su matriz de confusión.

In [0]:
tq = classify_questions(trump_questions)

plot_confusion_matrix(trump_labels, tq, classes=["Sincere","Insincere"], normalize=True,
                      title='Matriz de Confusión RNN - Preguntas que contienen Trump')

plt.figure(figsize=(20,20))
plt.show()
In [0]:
count_insincere_trump = 0
for label in trump_labels:
  if label:
    count_insincere_trump += 1
count_sincere_trump = len(trump_labels)-count_insincere_trump
print("Sincere trump count",count_sincere_trump)
print("Insincere trump count",count_insincere_trump)

Conclusiones del Trabajo Realizado

Podemos concluir que si se pueden ver patrones entre ambos tipos de preguntas pero se debe ademas considerar el peso de las palabras dentro de la pregunta para poder clasificar de manera efectiva.

Pero al utilizar solo los patrones, los modelos tienden a tener una gran cantidad de falsos positivos dentro de los modelos construidos. Se podrian generar modelos en base al peso de las palabras en vez de los patrones en las preguntas para que sea mas generalizable, debido que al utilizar la estructura este queda con un gran grado de susceptibilidad al contenido del dataset utilizado en su entrenamiento y testing.

Trabajo a futuro para proximos grupos

Seria interesante plantear el estudio de las preguntas en base a los pesos de las palabras que mencionabamos anteriormente, ademas de utilizar modelos que usen el procesamiento de lenguaje natural. Ademas seria interesante analizar la estructura linguistica de la pregunta para saber el objetivo de la oracion y que quiere provocar en la red social de Quora.